statistical performance
- Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.04)
- Europe > France > Île-de-France > Paris > Paris (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Federated Learning With L0 Constraint Via Probabilistic Gates For Sparsity
Huthasana, Krishna Harsha Kovelakuntla, Olama, Alireza, Lundell, Andreas
Federated Learning (FL) is a distributed machine learning setting that requires multiple clients to collaborate on training a model while maintaining data privacy. The unaddressed inherent sparsity in data and models often results in overly dense models and poor generalizability under data and client participation heterogeneity. We propose FL with an L0 constraint on the density of non-zero parameters, achieved through a reparameterization using probabilistic gates and their continuous relaxation: originally proposed for sparsity in centralized machine learning. We show that the objective for L0 constrained stochastic minimization naturally arises from an entropy maximization problem of the stochastic gates and propose an algorithm based on federated stochastic gradient descent for distributed learning. We demonstrate that the target density (rho) of parameters can be achieved in FL, under data and client participation heterogeneity, with minimal loss in statistical performance for linear and non-linear models: Linear regression (LR), Logistic regression (LG), Softmax multi-class classification (MC), Multi-label classification with logistic units (MLC), Convolution Neural Network (CNN) for multi-class classification (MC). We compare the results with a magnitude pruning-based thresholding algorithm for sparsity in FL. Experiments on synthetic data with target density down to rho = 0.05 and publicly available RCV1, MNIST, and EMNIST datasets with target density down to rho = 0.005 demonstrate that our approach is communication-efficient and consistently better in statistical performance.
- North America > United States > Virginia (0.04)
- Europe > Finland (0.04)
- Asia (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.54)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
On Using Large-Batches in Federated Learning
Abstract--Efficient Federated learning (FL) is crucial for training deep networks over devices with limited compute resources and bounded networks. With the advent of big data, devices either generate or collect multimodal data to train either generic or local-context aware networks, particularly when data privacy and locality is vital. Under frequent synchronization settings, FL over a large cluster of devices may perform more work per-training iteration by processing a larger global batch-size, thus attaining considerable training speedup. However, this may result in poor test performance (i.e., low test loss or accuracy) due to generalization degradation issues associated with large-batch training. T o address these challenges with large-batches, this work proposes our vision of exploiting the trade-offs between small and large-batch training, and explore new directions to enjoy both the parallel scaling of large-batches and good generalizability of small-batch training. For the same number of iterations, we observe that our proposed large-batch training technique attains about 32.33% and 3.74% higher test accuracy than small-batch training in ResNet50 and VGG11 models respectively. Collaborative or Federated learning (FL) methods are optimized to perform on-device training when clients are resource-constrained [22], [23], communication latency and bandwidth is bounded [3], and data privacy or locality is paramount [1], [24].
- North America > United States > California > Santa Clara County > Palo Alto (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (2 more...)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)
- Law (0.68)
- Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.04)
- Europe > France > Île-de-France > Paris > Paris (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.04)
- Europe > France > Île-de-France > Paris > Paris (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
OmniLearn: A Framework for Distributed Deep Learning over Heterogeneous Clusters
Deep learning systems are optimized for clusters with homogeneous resources. However, heterogeneity is prevalent in computing infrastructure across edge, cloud and HPC. When training neural networks using stochastic gradient descent techniques on heterogeneous resources, performance degrades due to stragglers and stale updates. In this work, we develop an adaptive batch-scaling framework called OmniLearn to mitigate the effects of heterogeneity in distributed training. Our approach is inspired by proportional controllers to balance computation across heterogeneous servers, and works under varying resource availability. By dynamically adjusting worker mini-batches at runtime, OmniLearn reduces training time by 14-85%. We also investigate asynchronous training, where our techniques improve accuracy by up to 6.9%.
- North America > United States > Indiana (0.04)
- North America > United States > Massachusetts (0.04)
- Europe > Czechia > Prague (0.04)
- Education (0.46)
- Information Technology (0.46)
Scalable Non-linear Learning with Adaptive Polynomial Expansions
Alekh Agarwal, Alina Beygelzimer, Daniel J. Hsu, John Langford, Matus J. Telgarsky
Can we effectively learn a nonlinear representation in time comparable to linear learning? We describe a new algorithm that explicitly and adaptively expands higher-order interaction features over base linear representations. The algorithm is designed for extreme computational efficiency, and an extensive experimental study shows that its computation/prediction tradeoff ability compares very favorably against strong baselines.